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We present results on light hadron masses from simulations of full QCD and report on experiences in running 
such simulations on a Hitachi SR8000-F1 supercomputer. 



1. INTRODUCTION 

With the installation of a 112 node Hitachi 
SR8000-F1 at the Leibniz- Rechenzentrum in Mu- 
nich the QCDSF Collaboration has started sim- 
ulating QCD with two degenerate flavors of 
fermions. In a cooperative program with the 
UKQCD Collaboration (see talk by Irving j^]) we 
share configurations giving both collaborations 
the possibility to do measurements at a larger set 
of bare parameters. 

This contribution has two parts. In the first 
part we report on performance of QCD calcula- 
tions on the Hitachi SR8000-F1. In the second 
part we present results on the light hadron mass 
spectrum for dynamical, non-perturbatively 0{a) 
improved Wilson fermions. We focus on compar- 
ison with our quenched results 

2. QCD ON THE HITACHI SR8000-F1 

2.1. Hardware overview 

The SR8000 in Munich is a "cluster" of 112 
symmetric multiprocessor nodes. Each node has 
8 CPUs and (at least) 8 GByte of shared mem- 
ory. Each CPU has a a 128 kByte 4-way as- 
sociative cache, 160 registers and has a pseudo- 
vectorization facility. The peak performance of a 
CPU is 1.5 Gflop/s giving a peak performance of 
1344 Gflop/s for the whole machine. 

2.2. Pseudo-vectorization 

It order to get good performance it is essential 
to employ pseudo-vectorization. At the hardware 
level pseudo-vectorization is implemented by effi- 
cient pre-fetch and pre-load mechanisms. At the 



programming level the compiler tries to vectorize 
most inner loops. These loops are vectorized un- 
der the usual conditions, i.e., if there are no data 
access conflicts, no function/subroutine calls and 
no if-blocks in the loop body. 

On the SR8000 one can basically use the same 
code as on a standard RISC processor and more 
important one can have the same data layout as 
for a RISC CPU. For example our data layout is 
(in Fortran 90 notation) 

complexes) u(3, 3, vol/2, even:odd, dim) 
complexes) a(4, 3, vol/2) 
complex(S) c(18, 2, vol/2, even:odd) 

for the gauge, pseudo-fermion and clover field. In 
contrast to a real vector computer the vector in- 
dex does not have to be shifted to the first place. 

2.3. Single CPU performance 

To fix the notation let M be the fermion ma- 
trix, D the Wilson hopping term and A the clover 
term; 

M = A- kD (1) 

A — 1 KCsw^ fiu^ uv (2) 

Performance figures for the multiplication of D, 
A, A^^ and M with a vector are given in Table 1. 
These figures were obtained with Fortran 90 code 
using 64 bit arithmetic on an 8'' lattice. The code 
performs well in comparison with RISC proces- 
sors where one typically sustains 10-20 % of the 
peak performance. 

2.4. Parallelization 

Our code is parallelized with MPI. In addition 
we have experimented with a hybrid program- 
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Figure 1. Communication loss for constant lattice 
are given in the text. 

Tabic 1 

Performance of the multiplication with a vector 
on a single CPU. 



D 


640 Mflop/s 


(43 % of peak) 


A 


1160 Mflop/s 


(77 % of peak) 




630 Mflop/s 


(42 % of peak) 


M 


676 Mflop/s 


(45 % of peak) 



ming model using OpcnMP for distributing the 
work over the CPUs of a single node and MPI 
for the communication between nodes. Figure 1 
shows the communication loss for the fermion ma- 
trix multiplication. 

The solid lines (being identical in both plots) 
show performance measurements for a lattice vol- 
ume of S** per CPU for the pure MPI code. In 
the left plot these are compared with measure- 
ments on volumes of 6"* per CPU. In the right plot 
comparison is done with the hybrid programming 
model again on a CPU local volume of 8^. 

As can be seen from Figiire 1 the communica- 
tion loss is quite substantial. The observations 
on the pure MPI code can be understood qual- 
itatively in the following way. Using more and 
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volume per CPU as indicated. Further explanations 



more CPUs the communication pattern becomes 

more and more complex. Within one node when 
using 2, 4 or 8 CPUs there arc 2, 4 or 6 bound- 
aries which have to be comniunic;atod and the sur- 
face/volume ratio is 2/8, 4/8 or 6/8 respectively. 

Between nodes communication is along 1, 2 or 3 
dimensions (in our program the lattice can be de- 
composed maximally in three dimensions) when 
using 2, 4, or 8 (and more) nodes leading again to 
more and more complicated communication pat- 
terns. 

On a single node parallelization with OpenMP 
is more efficient than with MPI. This is to be ex- 
pected because there is no communication needed 
in that case. But siirprisingly this gain does not 
persist when switching on inter node communica- 
tion. In both cases the same amount of data has 
to be transferred between the nodes. But in the 
pure MPI version more calls are needed to copy 
data between nodes and in addition data has to 
be copied within the nodes. Both should lead to 
larger communication overhead. 

According to the hardware performance moni- 
tor we are running at about 400-450 Mflop/s per 
CPU in our production runs. This is in accor- 
dance with the 433 Mflop/s given in Figure 1. 
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3. LIGHT HADRON MASSES 

3.1. Simulation details 

We investigate full QCD with the standard 
Wilson action for the plaquette and the fermions 
plus the 0{a) improvement term. For the im- 
provement coefficient Csw(/9) we use the non- 
perturbative values determined by the ALPHA 
collaboration |4|. 

Simulation parameters and statistics are listed 
in Table 2. In these simulations the m-ps/my 
ratio lies between 0.60 and 0.83. The Sommer 
scale ^ ro/a varies between 4.7 and 5.5 



Table 2 



/3 


^sea 


V 




UKQCD 


5.20 


0.1350 


l& X 32 


6000 




0.1355 


16^ X 32 


8000 


5.26 


0.1345 


16^ X 32 


4000 


5.29 


0.1340 


16^ X 32 


4000 


QCDSF 


5.29 


0.1350 


16^ X 32 


2270 




0.1355 


243 X 48 


605 



Figure 2 shows the integrated autocorrelation 
time of the plaquette Tint,p(i) = ^ + X]t'=i Ppi^') 
{pp is the normalized autocorrelation function) 
for the QCDSF run on the 16^ x 32 lattice. From 
the plot we read off that in this case Tint,p ^10. 

3.2. Results 

Figure 3 shows an APE plot. For clarity er- 
ror bars for the quenched results (0.1-3.5 %) are 
not shown. One sees that for currently reachable 
sea quark masses unquenching has, if any, only 
small effects on the light hadron spectrum and 
that with dynamical fermions it is a long way to 
mps/my ratios as small as in the quenched case. 

In the following we compare our dynamical 
results with chiral extrapolations done in the 
quenched case at /3 = 6.0 where ro/a = 5.36. 
This is within the range of ro/a of the dynamical 
runs studied. Let us first recall the chiral extrap- 
olation of my. Our quenched results for my are 
very well described by a phenomenological fit 

{amyf = {aMvf + 52(amps)^ + 63(aTOps)^ (3) 
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Figure 2. Integrated autocorrelation time of the 
plaquette for the run at f3 = 5.29 and k = 0.1350. 
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Figure 3. APE plot. Open symbols represent our 
quenched results without error bars. Filled 
symbols are our dynamical results. 

where 62 = 0.910(20) and 63 = 0.049(15) @. The 
lightest three points were excluded from the fit. 
The deviation of these points is believed to be 
a quenching artifact A quenching artifact 

appearing in quenched chiral perturbation theory 
for the vector meson mass is a term oc ampg H . 
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Figure 4. Comparison of quenched data and fit 
(3) with the dynamical results. The star indicates 
the experimental value. 




Figure 5. Comparison of mN quenched data and 
fit like in Figure 4 with the dynamical results. 
The star indicates the experimental value. 



In Figure 4 we plotted our quenched results to- 
gether with the dynamical ones. The line rep- 
rescints the fit (3). The fit range is muc;h larger 
than the range of this plot which only includes the 
six smallest masses (open symbols) . Towards the 
chiral limit we see that the dynamical data (filled 
symbols) lie systematically below the quenched 
data. The dynamical results are consistent with 
the quenched fit. 

The corresponding picture for ton is shown in 
Figure 5. The open symbols represent quenched 
results and the line is a phenomenological fit like 
(3) where the curvature 63 is again small. The 
filled symbols are dynamical results. The agree- 
ment between quenched and dynamical results is 
more pronounced than for my. 

3.3. Conclusions 

We observe that the results with non- 
perturbatively 0(a) improved dynamical 
fermions scale well. Although we are not able 
to do a continuum extrapolation yet, we see no 
large discretization errors. In the considered 
range of sea quark masses we found no clear evi- 
dence for unquenching effects in comparison with 
previous quenched results. 

However, our results seem to indicate that the 



vector meson mass my approaches the chiral limit 
in a different way. This may be another hint, that 
quenched artifacts are visible in the quenched 
data. 
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